Reputation-based Contents Crawling in Web Archiving System
نویسنده
چکیده
The size of the web archive is increasing exponentially, many national libraries are making efforts to preserve born-digital scientific, artistic and cultural contents. However, in order to crawl and store huge volume of digital information, it is very hard to resolve various problems from the social, legal and technical view points. In this paper, from the view points of long-term preserving digital contents with good reputation of trustiness, uniqueness and valuation, we discuss strategies to preserve monotonously increasing digital contents on web servers. According to experimental results of our reputation model, it makes possible to crawl socially valuable contents for archiving.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملIntelligent and Adaptive Crawling of Web Applications for Web Archiving
Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever...
متن کاملARCOMEM Crawling Architecture
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limita...
متن کاملIntelligent Event Focused Crawling
There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-todate information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effec...
متن کاملBuilding and archiving event web collections: A focused crawler approach
In this paper, we present a new approach for building and archiving web collections about events. Our approach combines the traditional focused crawling technique with event modeling and representation.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008